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18 Expectation 



18.1 Definitions and Examples 



The expectation or expected value of a random variable is a single number that 
tells you a lot about the behavior of the variable. Roughly, the expectation is the 
average value of the random variable where each value is weighted according to its 
probability. Formally, the expected value (also known as the average or mean) of a 
random variable is defined as follows. 

Definition 18.1.1. If i? is a random variable defined on a sample space S, then the 
expectation of R is 

Ex[^] ::= J2 R(w)Pt[w]. (18.1) 
weS 

For example, suppose S is the set of students in a class, and we select a student 
uniformly at random. Let R be the selected student's exam score. Then Ex[R] is 
just the class average — the first thing everyone wants to know after getting their test 
back! For similar reasons, the first thing you usually want to know about a random 
variable is its expected value. 

Let's work through some examples. 

18.1.1 The Expected Value of a Uniform Random Variable 

Let R be the value that comes up with you roll a fair 6-sided die. The the expected 
value of R is 

1 1 1 1 1 17 
ExlR] = l--+2-- + 3--+4-- + 5-- + 6-- = -. 
^ ^ 6 6 6 6 6 6 2 

This calculation shows that the name "expected value" is a little misleading; the 
random variable might never actually take on that value. You don't ever expect to 
roll a 3^ on an ordinary die! 

Also note that the mean of a random variable is not the same as the median. The 
median is the midpoint of a distribution. 

Definition 18.1.2. The median^ of a random variable R is the value x e range(/?) 



'Some texts define the median to be the value of x e range(/J) for which Pt[R < x] < 1/2 and 
Pr[R > x] < 1/2. The difference in definitions is not important. 
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such that 

Vr[R < x] < ^ and 
Vt[R > x] < ^. 

In this text, we will not devote much attention to the median. Rather, we will 
focus on the expected value, which is much more interesting and useful. 

Rolling a 6-sided die provides an example of a uniform random variable. In 
general, if i?„ is a random variable with a uniform distribution on {1, 2, . . . , n}, 
then 

Ex[;?„] = f = + = 

^ ^ n In 2 
1=1 

18.1.2 The Expected Value of an Indicator Random Variable 

The expected value of an indicator random variable for an event is just the proba- 
biUty of that event. 

Lemma 18.1.3. If I a is the indicator random variable for event A, then 

E4Ia] = Pr[^]. 

Proof. 

Ex[Ia] = 1 • Pr[lA = 1] + • Pr[lA = 0] 
= Pr[lA = 1] 

= Pr[^]. (defof/^) ■ 

For example, if A is the event that a coin with bias p comes up heads, then 
Ex[/^] = Pr[lA = l] = p. 

18.1.3 Alternate Definitions 

There are several equivalent ways to define expectation. 

Theorem 18.1.4. If R is a random variable defined on a sample space S then 

Ex[i?] = ^ X- Pr[i? = x\. (18.2) 

x€range(i?) 

The proof of Theorem 18.1.4, like many of the elementary proofs about expecta- 
tion in this chapter, follows by judicious regrouping of terms in the Equation 18.1: 



2 



"mcs-ftr' 



2010/9/8 



0:40 



page 469 — #475 



18.1. Definitions and Examples 
Proof. 

Bx[R] = R{a))Vr[co\ 



(Def 18.1.1 of expectation) 



£U65 

:x;erange(^) (jo€.[R=x\ 
xerange(if) cye[i?=A;] 

E 4 E P'-t'^] 

JC6range(i?) \a)6[i?=x] 

E x-Vt{R = x\. 

JC6range(i?) 



(def of the event [R = x\) 

(distributing x over the inner sum) 
(defofPr[^ = x]) 



The first equality follows because the events [R = x] for x e range(i?) partition 
the sample space S, so summing over the outcomes in [R = x] for x € range(i?) 
is the same as summing over S. ■ 

In general, Equation 18.2 is more useful than Equation 18.1 for calculating ex- 
pected values and has the advantage that it does not depend on the sample space, 
but only on the density function of the random variable. It is especially useful when 
the range of the random variable is N, as we will see from the following corollary. 

Corollary 18.1.5. If the range of a random variable R is N, then 

00 00 

Ex[i?] = E i Pr[^ = = E P^'t^ ^ 



i = l 



i=0 



Proof. The first equality follows directly from Theorem 18.1.4 and the fact that 
range(/?) = N. The second equality is derived by adding the following equations: 

Pr[R > 0] = Pr[R = 1] + Pr[R = 2] + Fr[R = 3] + • ■ • 
Pr[^ > 1] = Pr[/? = 2] + Pr[i? = 3] +••• 

Pt[R >2]= Pt[R = 3] + • • • 



E Pr[^ >i] = l- Pi[R = 1] + 2 ■ Pi[R = 2] + 3 • Pr[R = 3] + • 

oo 

= E'Pr[^ = '']- 



i=0 



i = l 
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Chapter 18 Expectation 

18.1.4 Mean Time to Failure 

The mean time to failure is a critical parameter in the design of most any system. 
For example, suppose that a computer program crashes at the end of each hour of 
use with probability if it has not crashed already. What is the expected time until 
the program crashes? 

If we let C be the number of hours until the crash, then the answer to our prob- 
lem is Ex[C]. C is a random variable with values in N and so we can use Corol- 
lary 18.1.5 to determine that 

oo 

Ex[C] = ^Pr[C > /]. (18.3) 

Pr[C > /] is easy to evaluate: a crash happens later than the /th hour iff the 
system did not crash during the first i hours, which happens with probability (1 — 
pY . Plugging this into Equation 18.3 gives: 

oo 

Ex[C] = ^(l-;7y 

(=0 

1 

= (sum of geometric series) 

\-{\-p) 

= -. (18.4) 
P 

For example, if there is a 1% chance that the program crashes at the end of each 
hour, then the expected time until the program crashes is 1/0.01 = 100 hours. 
The general principle here is well- worth remembering: 

If a system fails at each time step with probability p, then the expected 
number of steps up to (and including) the first failure is 1/ p. 

Making Babies 

As a related example, suppose a couple really wants to have a baby girl. For sim- 
plicity, assume that there is a 50% chance that each child they have is a girl, and 
that the genders of their children are mutually independent. If the couple insists on 
having children until they get a girl, then how many baby boys should they expect 
first? 

The question, "How many hours until the program crashes?" is mathematically 
the same as the question, "How many children must the couple have until they 
get a girl?" In this case, a crash corresponds to having a girl, so we should set 
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p = 1/2. By the preceding analysis, the couple should expect a baby girl after 
having 1/ p = 2 children. Since the last of these will be the girl, they should 
expect just one boy. 

18.1.5 Dealing with Infinity 

The analysis of the mean time to failure was easy enough. But if you think about it 
further, you might start to wonder about the case when the computer program never 
fails. For example, what if the program runs forever? How do we handle outcomes 
with an infinite value? 

These are good questions and we wonder about them too. Indeed, mathemati- 
cians have gone to a lot of work to reason about sample spaces with an infinite 
number of outcomes or outcomes with infinite value. 

To keep matters simple in this text, we will follow the common convention of 
ignoring the contribution of outcomes that have probabiUty zero when computing 
expected values. This means that we can safely ignore the "never-fail" outcome, 
because it has probability 

lim (1 - p)" = 0. 

n^oo 

In general, when we are computing expectations for infinite sample spaces, we 
will generally focus our attention on a subset of outcomes that occur with collec- 
tive probability one. For the most part, this will allow us to ignore the "infinite" 
outcomes because they will typically happen with probability zero.^ 

This assumption does not mean that the expected value of a random variable is 
always finite, however. Indeed, there are many examples where the expected value 
is infinite. And where infinity raises its ugly head, trouble is sure to follow. Let's 
see an example. 

18.1.6 Pitfall: Computing Expectations by Sampling 

Suppose that you are trying to estimate a parameter such as the average delay across 
a communication channel. So you set up an experiment to measure how long it 
takes to send a test packet from one end to the other and you run the experiment 
100 times. 

You record the latency, rounded to the nearest millisecond, for each of the hun- 
dred experiments, and then compute the average of the 100 measurements. Suppose 
that this average is 8.3 ms. 

Because you are careful, you repeat the entire process twice more and get aver- 
ages of 7.8 ms and 7.9 ms. You conclude that the average latency across the channel 

^If this still bothers you, you might consider taking a course on measure theory. 
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7.8 + 7.9 + 8.3 

= 8 ms. 

3 

You might be right but you might also be horribly wrong. In fact, the expected 

latency might well be infinite. Here's how. 

Let Z) be a random variable that denotes the time it takes for the packet to cross 
the channel. Suppose that 

(O for / = 

^[D = i]=\^ 1 ^ . , (18.5) 

It is easy to check that 

and so D is, in fact, a random variable. 

From Equation 18.5, we might expect that D is likely to be small. Indeed, D = 1 
with probability 1/2, D = 2 with probability 1/6, and so forth. So if we took 
100 samples of D, about 50 would be 1 ms, about 16 would be 2 ms, and very 
few would be large. In summary, it might well be the case that the average of the 
100 measurements would be under 10 ms, just as in our example. 

This sort of reasoning and the calculation of expected values by averaging ex- 
perimental values is very common in practice. It can easily lead to incorrect con- 
clusions, however. For example, using Corollary 18.1.5, we can quickly (and accu- 
rately) determine that 



Ex[D] = J2 i Pr[Z) = /] 

= ,|''(«rb) 

■m 



oo. 



Uh-oh! The expected time to cross the communication channel is infinitel This 
result is a far cry from the 10 ms that we calculated. What went wrong? 
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It is true that most of the time, the value of D will be small. But sometimes 
D will be very large and this happens with sufficient probability that the expected 
value of D is unbounded. In fact, if you keep repeating the experiment, you are 
likely to see some outcomes and averages that are much larger than 10 ms. In 
practice, such "outUers" are sometimes discarded, which masks the true behavior 
of O. 

In general, the best way to compute an expected value in practice is to first use 
the experimental data to figure out the distribution as best you can, and then to use 
Theorem 18.1.4 or Corollary 18.1.5 to compute its expectation. This method will 
help you identify cases where the expectation is infinite, and will generally be more 
accurate than a simple averaging of the data. 

18.1.7 Conditional Expectation 

Just Uke event probabilities, expectations can be conditioned on some event. Given 

a random variable R, the expected value of R conditioned on an event A is the 
(probability-weighted) average value of R over outcomes in A. More formally: 

Definition 18.1.6. The conditional expectation Ex[R | ^] of a random variable R 
given event A is: 

Ex[R\A]::= ^ r-Pr[R = r\A]. (18.6) 

rerange(i?) 

For example, we can compute the expected value of a roll of a fair die, given, 
for example, that the number rolled is at least 4. We do this by letting R be the 
outcome of a roll of the die. Then by equation (18.6), 

6 

Ex[R I i? > 4] = ^/■Pr[i? = / I R>4] = l-0+2-0+3-0+4-|+5-|+6-| = 5. 
i=l 

As another example, consider the channel latency problem from Section 18.1.6. 
The expected latency for this problem was infinite. But what if we look at the 
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expected latency conditioned on the latency not exceeding n. Then 

oo 

Ex[D] = ^ / Pr [D = / I D <n\ 
. Vv[D = i aD <n] 



1 = 1 

oo 



= E 

1 = 1 

n 

= E 



Pr[D < n] 
i Fr[D = i] 



^ Pr[Z) < n] 
1=1 

1 

Pr[/)~ 



1 " 1 

- — Y — 

1=1 



1 



Pr[D < n] 

where Hn+i is the (« + l)st Harmonic number 



(Hn + l - 1), 



1 1 e(n) 
Hn+i = ln(n + 1) + y + — + + 



2n Un^ 120^4 

and < < 1. The second equality follows from the definition of conditional 
expectation, the third equality follows from the fact that Pr[D = i A D < n] = 
for i > n, and the fourth equality follows from the definition of D in Equation 18.5. 
To compute Pr[Z) < n], we observe that 

Pr[Z) < «] = 1 - Pr[Z) > n] 

i=n+\ ^ ' 

~^ (« + 1 n + l)^^n + l n + s) 
V« + 3 n + 4y 



= 1 



n + 1 



« + r 
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Hence, 

Ex[D] = "^iHn+i-l). (18.7) 
n 

For n = 1000, this is about 6.5. This explains why the expected value of D appears 
to be finite when you try to evaluate it experimentally. If you compute 100 samples 
of Z), it is likely that all of them will be at most 1000ms. If you condition on not 
having any outcomes greater than 1000 ms, then the conditional expected value will 
be about 6.5 ms, which would be a commonly observed result in practice. Yet we 
know that Ex[Z)] is infinite. For this reason, expectations computed in practice are 
often really just conditional expectations where the condition is that rare "outher" 
sample points are eliminated from the analysis. 

18.1.8 The Law of Total Expectation 

Another useful feature of conditional expectation is that it lets us divide compli- 
cated expectation calculations into simpler cases. We can then find the desired 
expectation by calculating the conditional expectation in each simple case and av- 
eraging them, weighing each case by its probability. 

For example, suppose that 49.8% of the people in the world are male and the 
rest female — which is more or less true. Also suppose the expected height of a 
randomly chosen male is 5' 11", while the expected height of a randomly chosen 
female is 5' 5". What is the expected height of a randomly chosen individual? We 
can calculate this by averaging the heights of men and women. Namely, let H be 
the height (in feet) of a randomly chosen person, and let M be the event that the 
person is male and F the event that the person is female. Then 

Ex[H] = Ex[H I M]Pr[M] + Ex[H \ F]Pr[F] 

= (5+ 11/12) • 0.498 + (5 + 5/12) • 0.502 
= 5.665 

which is a little less than 5' 8". 

This method is justified by the Law of Total Expectation. 

Theorem 18.1.7 (Law of Total Expectation). Let R be a random variable on a 
sample space S and suppose that Ai, A2, is a partition ofS. Then 

Ex[i?] = ^Ex[i? I ^,]Pr[^,]. 
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Proof. 

Ex[i?] = ^ r ■Pr[i? = r] (Equation 18.2) 

rerange(i?) 

= ^ r • ^ Pr [7? = r | Ai\ Pr[^/] (Law of Total Probability) 

r i 

= ^ ^ r ■Vt\R = r I Ai\ Pr[^,] (distribute constant r) 

r i 

= ^ ^ r • Pr [i? = r | Ai\ Pr[^,] (exchange order of summation) 
i r 

= Pr[Ai] ^ r ■ Pr [i? = r | Aj] (factor constant Pr[^,]) 

i r 

= ^ Pr[^,] Ex[^ I Ai]. (Def 18.1.6 of cond. expectation) 



As a more interesting application of the Law of Total Expectation, let's take 
another look at the mean time to failure of a system that fails with probability p at 
each step. We'll define A to be the event that the system fails on the first step and 
A to be the complementary event (namely, that the system does not fail on the first 
step). Then the mean time to failure Ex[C] is 

Ex[C] = Ex[C I A] Py[A] + Ex[C | A] Pr[A]. (18.8) 

Since A is the condition that the system crashes on the first step, we know that 

Ex[C M] = 1. (18.9) 

Since A is the condition that the system does not crash on the first step, conditioning 
on A is equivalent to taking a first step without failure and then starting over without 
conditioning. Hence, 

Ex[C I ^4] = 1 +Ex[C]. (18.10) 
Plugging Equations 18.9 and 18.10 into Equation 18.8, we find that 

Ex[C] = l-p + {l+Ex[C]){l- p) 
= p+l- p + il- p)Ex[C] 
= I + i\ - p)Ex[C]. 
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Rearranging terms, we find that 

1 = Ex[C] - (1 - i?)Ex[C] = pBx[C\, 

and thus that 

Ex[C] = -, 
P 

as expected. 

We will use this sort of analysis extensively in Chapter 20 when we examine the 
expected behavior of random walks. 

18.1.9 Expectations of Functions 

Expectations can also be defined for functions of random variables. 

Definition 18.1.8. Let 7? : 5 ^ F be a random variable and / : F ^ M be a total 
function on the range of R. Then 

Ex[/(i?)] = f{R{w)) Pr[u;]. (18.1 1) 

Equivalently, 

Ex[/(^)]= firWR^r]. (18.12) 

r6range(i?) 

For example, suppose that R is the value obtained by rolling a fair 6-sided die. 
Then 

11 11 11 11 11 1 1 _ 49 
T"6"'"2'6"'"3'6"'"4"6"'"5'6"'"6'6~ 120' 



18.2 Expected Returns in Gambling Games 

Some of the most interesting examples of expectation can be explained in terms of 
gambling games. For straightforward games where you win $^ with probability p 
and you lose %B with probabiUty 1 — it is easy to compute your expected return 
or winnings. It is simply 

pA-{l-p)B. 

For example, if you are flipping a fair coin and you win $ 1 for heads and you lose $ 1 
for tails, then your expected winnings are 

i.-(,-i). = o. 



Ex 
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In such cases, the game is said to he fair since your expected return is zero. 

Some gambling games are more complicated and thus more interesting. For 
example, consider the following game where the winners split a pot. This sort of 
game is representative of many poker games, betting pools, and lotteries. 

18.2.1 Splitting the Pot 

After your last encounter with biker dude, one thing lead to another and you have 
dropped out of school and become a Hell's Angel. It's late on a Friday night and, 
feeling nostalgic for the old days, you drop by your old hangout, where you en- 
counter two of your former TAs, Eric and Nick. Eric and Nick propose that you 
join them in a simple wager Each player will put $2 on the bar and secretly write 
"heads" or "tails" on their napkin. Then one player will flip a fair coin. The $6 on 
the bar will then be divided equally among the players who correctly predicted the 
outcome of the coin toss. 

After your life-altering encounter with strange dice, you are more than a little 
skeptical. So Eric and Nick agree to let you be the one to flip the coin. This 
certainly seems fair How can you lose? 

But you have learned your lesson and so before agreeing, you go through the 
four-step method and write out the tree diagram to compute your expected return. 
The tree diagram is shown in Figure 18.1. 

The "payoff" values in Figure 18.1 are computed by dividing the $6 pot^ among 
those players who guessed correctly and then subtracting the $2 that you put into 
the pot at the beginning. For example, if all three players guessed correctly, then 
you payoff is $0, since you just get back your $2 wager. If you and Nick guess 
correctly and Eric guessed wrong, then your payoff is 



In the case that everyone is wrong, you all agree to split the pot and so, again, your 
payoff is zero. 

To compute your expected return, you use Equation 18.1 in the definition of 
expected value. This yields 



Ex[payoff] = • - -M 



1 




8 
1 



+ 1 



+ (-2) 



8 



+ (■ 



2)-\ + (-2) 4 + • 



8 



1 



= 0. 



The money invested in a wager is commonly referred to as the pot. 
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you guess 
right? 


Eric guesses 
right? 


Nick guesses 
right? 


your 
payoff 


probability 






yes 




$0 


1/8 




yes 1/2 


no 




$1 


1/8 


yes 1/2 


no 1,^\, 


yes 




$1 


1/8 






no 




$4 


1/8 






yes 




-$2 


1/8 


no 1/2 \. 


yes 1/2 ^ 

' ^^^^ 


no 




-$2 


1/8 




no 1^^ 


yes 




-$2 


1/8 






no 




$0 


1/8 


Figure 18.1 The tree diagram for the game where three players each wager $2 
and then guess the outcome of a fair coin toss. The winners spht the pot. 
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This confirms that the game is fair. So, for old time's sake, you break your solemn 
vow to never ever engage in strange gambling games. 

18.2.2 The Impact of Collusion 

Needless to say, things are not turning out well for you. The more times you play 
the game, the more money you seem to be losing. After 1000 wagers, you have lost 
over $500. As Nick and Eric are consoling you on your "bad luck," you do a back- 
of-the-napkin calculation using the bounds on the tails of the binomial distribution 
from Section 17.5 that suggests that the probability of losing $500 in 1000 wagers 
is less than the probability of a Vietnamese Monk waltzing in and handing you one 
of those golden disks. How can this be? 

It is possible that you are truly very very unlucky. But it is more likely that 
something is wrong with the tree diagram in Figure 18.1 and that "something" just 
might have something to do with the possibihty that Nick and Eric are colluding 
against you. 

To be sure, Nick and Eric can only guess the outcome of the coin toss with 
probability 1/2, but what if Nick and Eric always guess differently? In other words, 
what if Nick always guesses "tails" when Eric guesses "heads," and vice-versa? 
This would result in a slightly different tree diagram, as shown in Figure 18.2. 

The payoffs for each outcome are the same in Figures 18.1 and 18.2, but the 
probabilities of the outcomes are different. For example, it is no longer possible 
for all three players to guess correctly, since Nick and Eric are always guessing 
differently. More importantly, the outcome where your payoff is $4 is also no 
longer possible. Since Nick and Eric are always guessing differently, one of them 
will always get a share of the pot. As you might imagine, this is not good for you! 

When we use Equation 18.1 to compute your expected return in the collusion 
scenario, we find that 

Exbayoff] = 0- 0+ l- - + l- - + 4- 

4 4 

+ (-2).0 + (-2).i + (-2).i+0.0 
1 

This is very bad indeed. By colluding, Nick and Eric have made it so that you 
expect to lose $.50 every time you play. No wonder you lost $500 over the course 
of 1000 wagers. 

Maybe it would be a good idea to go back to school — ^your Hell's Angels buds 
may not be too happy that you just lost their $500. 
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you guess 
right? 


Eric guesses 
right? 


Nick guesses 
right? 


your 
payoff 


probability 






yes 


$0 







yes 1/2 


no 1^^* 


$1 


1/4 


yes 1/2 / 


no 1^^^ 


yes 1 


$1 


1/4 






no 


$4 









yes 0^--* 


-$2 





no 1/2 \. 


yes 1/2 ^ 


no 1^^ 


-$2 


1/4 




no V^"^ 


yes j^^-* 


-$2 


1/4 






no 


$10 





Figure 18.2 The revised tree diagram reflecting the scenario where Nick always 
guesses the opposite of Eric. 
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18.2.3 How to Win the Lottery 

Similar opportunities to "collude" arise in many betting games. For example, con- 
sider the typical weekly football betting pool, where each participant wagers $10 
and the participants that pick the most games correctly split a large pot. The pool 
seems fair if you think of it as in Figure 18.1. But, in fact, if two or more players 
collude by guessing differently, they can get an "unfair" advantage at your expense! 

In some cases, the collusion is inadvertent and you can profit from it. For ex- 
ample, many years ago, a former MIT Professor of Mathematics named Herman 
Chernoff figured out a way to make money by playing the state lottery. This was 
surprising since state lotteries typically have very poor expected returns. That's be- 
cause the state usually takes a large share of the wagers before distributing the rest 
of the pot among the winners. Hence, anyone who buys a lottery ticket is expected 
to lose money. So how did Chernoff find a way to make money? It turned out to be 
easy! 

In a typical state lottery, 

• all players pay $1 to play and select 4 numbers from 1 to 36, 

• the state draws 4 numbers from 1 to 36 uniformly at random, 

• the states divides 1/2 of the money collected among the people who guessed 
correctly and spends the other half redecorating the governor's residence. 

This is a lot like the game you played with Nick and Eric, except that there are 
more players and more choices. Chernoff discovered that a small set of numbers 
was selected by a large fraction of the population. Apparently many people think 
the same way; they pick the same numbers not on purpose as in the previous game 
with Nick and Eric, but based on Manny's batting average or today's date. 

It was as if the players were colluding to lose! If any one of them guessed 
correctly, then they'd have to split the pot with many other players. By selecting 
numbers uniformly at random, Chernoff was unlikely to get one of these favored 
sequences. So if he won, he'd likely get the whole pot! By analyzing actual state 
lottery data, he determined that he could win an average of 7 cents on the dollar. In 
other words, his expected return was not —$.50 as you might think, but +$.07.^ 

Inadvertent collusion often arises in betting pools and is a phenomenon that you 
can take advantage of. For example, suppose you enter a Super Bowl betting pool 
where the goal is to get closest to the total number of points scored in the game. 
Also suppose that the average Super Bowl has a total of 30 point scored and that 

''Most lotteries now offer randomized tickets to help smooth out the distribution of selected se- 
quences. 
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everyone knows this. Then most people will guess around 30 points. Where should 
you guess? Well, you should guess just outside of this range because you get to 
cover a lot more ground and you don't share the pot if you win. Of course, if you 
are in a pool with math students and they all know this strategy, then maybe you 
should guess 30 points after all. 



18.3 Expectations of Sums 



18.3.1 Linearity of Expectation 

Expected values obey a simple, very helpful rule called Linearity of Expectation. 
Its simplest form says that the expected value of a sum of random variables is the 
sum of the expected values of the variables. 

Theorem 18.3.1. For any random variables Ri and R2, 

Ex[Ri + ^2] = Ex[^i] + Ex[^2]. 

Proof. Let T ::= Ri + R2. The proof follows straightforwardly by rearranging 
terms in Equation (18.1): 

Ex[r] = J2 ^('^) • (Definition 18.1.1) 

(oeS 

= ^(Ri(a)) + Riico)) ■ Pt[(o] (definition of T) 

— ^ Ri(ci>)Pr[a>] + ^ R2ico)PT[a)] (rearranging terms) 

coeS (oeS 
= Ex[Ri] + Ex[R2]. (Definition 18.1.1) ■ 

A small extension of this proof, which we leave to the reader, imphes 

Theorem 18.3.2. For random variables R\, R2 and constants ai, 02 € M, 

Ex[ai^i + 02-^2] = a\ Ex[^i] + a2Ex[i?2]- 

In other words, expectation is a linear function. A routine induction extends the 
result to more than two variables: 

Corollary 18.3.3 (Linearity of Expectation). For any random variables R\,...,Rk 
and constants ai, . . . ,aic € M, 

k k 
Ex(^aiRi] — y^^aj Ex[i?,]. 
1=1 j=i 
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The great thing about linearity of expectation is that no independence is required. 
This is really useful, because dealing with independence is a pain, and we often 
need to work with random variables that are not known to be independent. 

As an example, let's compute the expected value of the sum of two fair dice. Let 
the random variable i?i be the number on the first die, and let R2 be the number on 
the second die. We observed earlier that the expected value of one die is 3.5. We 
can find the expected value of the sum using linearity of expectation: 

Ex[i?i + R2\ = Ex[/?i] + Ex[/?2] = 3.5 + 3.5 = 7. 

Notice that we did not have to assume that the two dice were independent. The 
expected sum of two dice is 7, even if they are glued together (provided each indi- 
vidual die remains fair after the gluing). Proving that this expected sum is 7 with a 
tree diagram would be a bother: there are 36 cases. And if we did not assume that 
the dice were independent, the job would be really tough! 

18.3.2 Sums of Indicator Random Variables 

Linearity of expectation is especially useful when you have a sum of indicator ran- 
dom variables. As an example, suppose there is a dinner party where n men check 
their hats. The hats are mixed up during dinner, so that afterward each man receives 
a random hat. In particular, each man gets his own hat with probability \/n. What 
is the expected number of men who get their own hat? 

Letting G be the number of men that get their own hat, we want to find the 
expectation of G. But all we know about G is that the probability that a man gets 
his own hat back is \/n. There are many different probabiUty distributions of hat 
permutations with this property, so we don't know enough about the distribution 
of G to calculate its expectation directly. But linearity of expectation makes the 
problem really easy. 

The trick^ is to express G as a sum of indicator variables. In particular, let Gi be 
an indicator for the event that the /th man gets his own hat. That is, Gi = 1 if the 
ith man gets his own hat, and G, = otherwise. The number of men that get their 
own hat is then the sum of these indicator random variables: 

G = Gi + G2 + --- + Gn. (18.13) 

These indicator variables are not mutually independent. For example, if « — 1 men 

all get their own hats, then the last man is certain to receive his own hat. But, since 
we plan to use linearity of expectation, we don't have worry about independence! 

^We are going to use this trick a lot so it is important to understand it. 
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Since G,- is an indicator random variable, we know from Lemma 18.1.3 that 

Ex[G/] = Pr[G,- = 1] = 1/n. (18.14) 

By Linearity of Expectation and Equation 18.13, this means that 

Ex[G] = Ex[Gi + G2 + • • • + G„] 

= Ex[Gi] + Ex[G2] + • • • + Ex[G„] 



1 1 1 
= - + - + •••+- 
n n n 

= 1. 

So even though we don't know much about how hats are scrambled, we've figured 
out that on average, just one man gets his own hat back! 

More generally, Linearity of Expectation provides a very good method for com- 
puting the expected number of events that will happen. 

Theorem 18.3.4. Given any collection of n events Ai, A2, . . . , Af, C S, the ex- 
pected number of events that will occur is 



^PrMi]. 



1=1 



For example, Ai could be the event that the /'th man gets the right hat back. But 
in general, it could be any subset of the sample space, and we are asking for the 
expected number of events that will contain a random sample point. 

Proof. Define Ri to be the indicator random variable for Ai , where Ri (w) = 1 if 
w e Ai and Ri(w) = if w ^ Ai. Let R ^ Ri + R2 -\ \- Rn- Then 

n 

Ex[R] — ^Ex[i?, ] (by Linearity of Expectation) 

1 = 1 

n 

= ^Pr[i?i = 1] (by Lemma 18.1.3) 



1=1 

n 



= ^2 ^ Pr[w] (definition of indicator variable) 

i = l weAi 
n 

i = l 
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So whenever you are asked for the expected number of events that occur, all you 
have to do is sum the probabilities that each event occurs. Independence is not 
needed. 

18.3.3 Expectation of a Binomial Distribution 

Suppose that we independently flip n biased coins, each with probabihty p of com- 
ing up heads. What is the expected number of heads? 

Let J be the random variable denoting the number of heads. Then J has a 
binomial distribution with parameters n, p, and 

Pr[/ =k] = {^kP{n - ky-P. 

Applying Equation 18.2, this means that 

n 

Ex[J] = ^A:Pr[/ =k] 



k=Q 
n 



= (18.15) 



Ouch! This is one nasty looking sum. Let's try another approach. 

Since we have just learned about linearity of expectation for sums of indicator 
random variables, maybe Theorem 18.3.4 will be helpful. But how do we express J 
as a sum of indicator random variables? It turns out to be easy. Let /, be the 
indicator random variable for the ith coin. In particular, define 



Ji = 



jl if the ith coin is heads 
1 if the ith coin is tails. 



Then the number of heads is simply 

J = Ji + J2 + --- + Jn- 

By Theorem 18.3.4, 



Ex[7] = ^Pr[/,] 
i = l 

= np. (18.16) 
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That really was easy. If we flip n mutually independent coins, we expect to get 
pn heads. Hence the expected value of a binomial distribution with parameters n 
and p is simply pn. 

But what if the coins are not mutually independent? It doesn't matter — ^the an- 
swer is still pn because Linearity of Expectation and Theorem 18.3.4 do not as- 
sume any independence. 

If you are not yet convinced that Linearity of Expectation and Theorem 18.3.4 
are powerful tools, consider this: without even trying, we have used them to prove 
a very complicated identity, namely^ 



If you are still not convinced, then take a look at the next problem. 
18.3.4 The Coupon Collector Problem 

Every time we purchase a kid's meal at Taco Bell, we are graciously presented with 
a miniature "Racin' Rocket" car together with a launching device which enables us 
to project our new vehicle across any tabletop or smooth floor at high velocity. 
Truly, our deUght knows no bounds. 

There are n different types of Racin' Rocket cars (blue, green, red, gray, etc.). 
The type of car awarded to us each day by the kind woman at the Taco Bell reg- 
ister appears to be selected uniformly and independently at random. What is the 
expected number of kid's meals that we must purchase in order to acquire at least 
one of each type of Racin' Rocket car? 

The same mathematical question shows up in many guises: for example, what 
is the expected number of people you must poll in order to find at least one person 
with each possible birthday? Here, instead of collecting Racin' Rocket cars, you're 
collecting birthdays. The general question is commonly called the coupon collector 
problem after yet another interpretation. 

A clever application of linearity of expectation leads to a simple solution to the 
coupon collector problem. Suppose there are five different types of Racin' Rocket 
cars, and we receive this sequence: 

blue green green red blue orange blue orange gray. 

Let's partition the sequence into 5 segments: 

blue green green red blue orange blue orange gray . 





^This follows by combining Equations 18.15 and 18.16. 
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The rule is that a segment ends whenever we get a new kind of car. For example, the 
middle segment ends when we get a red car for the first time. In this way, we can 
break the problem of collecting every type of car into stages. Then we can analyze 
each stage individually and assemble the results using Unearity of expectation. 

Let's return to the general case where we're collecting n Racin' Rockets. Let 
X]^ be the length of the kth segment. The total number of kid's meals we must 
purchase to get all n Racin' Rockets is the sum of the lengths of all these segments: 

T = XQ + Xi+--- + Xn-i 

Now let's focus our attention on X]^, the length of the kHa segment. At the 
beginning of segment k, we have k different types of car, and the segment ends 
when we acquire a new type. When we own k types, each kid's meal contains a 
type that we akeady have with probability k/n. Therefore, each meal contains a 
new type of car with probabiUty I— k/n = (n — k)/n. Thus, the expected number 
of meals until we get a new kind of car is n/(n — k) by the "mean time to failure" 
formula in Equation 18.4. This means that 

Ex[Xk] = 



n — k 



Linearity of expectation, together with this observation, solves the coupon col- 
lector problem: 

Ex[r] = Ex[Zo + Xi + • • • + X„-i] 

= Ex[Xo] + Ex[Xi] + ■■■ + Ex[Z„_i] 

n n n n n 

+ 7 +•••+ -: + - + - 



n-0 n-\ 3 2 1 

/II 1 1 1\ 

= n[- + 7 + •••+ r + ;^ + 7 

\n n-\ 3 2 1/ 

nil 1 i\ 

yl 2 3 n — l n ) 

= nH„ (18.17) 
~«ln«. (18.18) 

Wow! It's those Harmonic Numbers again! 

We can use Equation 18.18 to answer some concrete questions. For example, the 
expected number of die rolls required to see every number from 1 to 6 is: 

6He = 14.7.... 
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And the expected number of people you must poll to find at least one person with 


each possible birthday is: 










365/f365 = 


= 2364.6 .... 


18.3.5 Infinite Sums 








Linearity of expectation also works for an infinite number of random variables 


provided that the variables satisfy 


some 


stringent absolute convergence criteria. 


Theorem 18.3.5 (Linearity of Expectation). Let Rq, R\, ...,be random variables 


such that 






oo 










T.^A\Ri\\ 

z=n 


converges. Then 














oo 


oo 




Ex 




= I]Ex[7?,]. 






. i- 


=0 . 


i=0 


Proof. Let T ::= YZ 


= ^J- 








We leave it to the reader to verify that, under the given convergence hypothesis, 


all the sums in the following derivation are absolutely convergent, which justifies 


rearranging them as follows: 








oo oo 










^Ex[i?,] = ;^^i?,(^)-PrM 


(Def. 18.1.1) 


1=0 i=Qs&S 








= y,y,Riis)-vr[s\ 


(exchanging order of summation) 


seSi 


=0 








= E 




■ Pr[5] (factoring out Pr[5]) 


seS 










= Vr(.)-Pr[5] 




(Def. of T) 


S€S 










= Ex[r] 






(Def. 18.1.1) 


00 

= Ex[J2Ri]- 






(Def. of T). ■ 


i 


=0 









23 



"mcs-ftr' — 2010/9/8 — 0:40 — page 490 — #496 



Chapter 18 Expectation 



18.4 Expectations of Products 



While the expectation of a sum is the sum of the expectations, the same is usually 
not true for products. For example, suppose that we roll a fair 6-sided die and 
denote the outcome with the random variable R. Does Ex[^ • R\ = Ex[i?] • Ex[i?]? 

We know that Ex[/?] = 3^ and thus Ex[/?]2 = 12^. Let's compute Ex[i?2] to 
see if we get the same result. 

Ex[/?2] = R^{w)Vt[w\ 
weS 
6 

= J2i^.Pr[Ri=i] 

i = l 

1^ 2^ 3^ 42 5^ 6^ 
= 6 + 6 + + + 6 

= 15 1/6 

^ 12 1/4. 

Ex[^ • R] 7^ Ex[i?] • Ex[^] 



Hence, 



and so the expectation of a product is not always equal to the product of the expec- 
tations. 

There is a special case when such a relationship does hold however; namely, 
when the random variables in the product are independent. 

Theorem 18.4.1. For any two independent random variables R\, R2, 

Ex[i?i-^2] = Ex[i?i]-Ex[i?2]. 
Proof. The event {R\ ■ R2 = r] can be split up into events of the form [Ri = 
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r\ and R2 = fa] where ri • r2 = r. So 
Ex[i?i ■ R2\ 



(Theorem 18.1.4) 



^ r • Pr[i?i ■R2 = r] 

r€range(i?ri?2) 

= ^ ^ rir2 •Pr[i?i = r\ and i?2 = ''2] 

rierange(^l) r2erange(i?2) 

= ^ ^ rir2 ■ Pr[i?i = ri] ■ Pr[i?2 = ^2] (independence of Ri, R2) 

rierange(/?i) r2erange(^2) 



^ ri Pr[^i = ri] ^ r2 Pr[/?2 = ^2] | (factor out rj Pr[/?i = n]) 

ri6range(i?i) \r26range(/?2) 

^ riPr[^i = ri].Ex[i?2] 

ri6range(i?i) 



(Theorem 18.1.4) 



Ex[i?2] n'Px[Ri = n] 

\rierange(iJi) 

Ex[i?2]-Ex[i?i]. 



(factor out Ex[i?2]) 
(Theorem 18.1.4) 



For example, let R\ and R2 be random variables denoting the result of rolUng 
two independent and fair 6-sided dice. Then 

Ex[i?i • R2] = Ex[i?i]Ex[7?2] = ^\-^\ = 12^- 

Theorem 18.4.1 extends by induction to a collection of mutually independent 
random variables. 

Corollary 18.4.2. If random variables Ri, R2 Rk cire mutually independent, 

then 



Ex 



k 



nEx[i?,]. 
1=1 
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18.5 Expectations of Quotients 

If S and T are random variables, we know from Linearity of Expectation that 

Ex[5 + r] = Ex[S]+Ex[r]. 
If S and T are independent, we know from Theorem 18.4.1 that 

Ex[5r] = Ex[5]Ex[r]. 

Is it also true that 



Ex[5/r] = Ex[5]/Ex[r]? 



(18.19) 



Of course, we have to worry about the situation when Ex[r] = 0, but what if we 
assume that T is always positive? As we will soon see. Equation 18.19 is usually 
not true, but let's see if we can prove it anyway. 

False Claim 18.5.1. If S and T are independent random variables with T > 0, 
then 

Ex[S/r] = Ex[S]/Ex[r]. 



(18.20) 



Bogus proof. 



Ex[^] = Ex[5 • i] 
= Ex[S] • Ex 
= Ex[S] 



(independence of S and T) 



1 



Ex[r] 



(18.21) 
(18.22) 



^ Ex[S] ^ 

Ex[r]- 

Note that Une 18.21 uses the fact that if S and T are independent, then so are 
S and \/T. This holds because functions of independent random variables are 
independent. It is a fact that needs proof, which we will leave to the reader, but it 
is not the bug. The bug is in line (18.22), which assumes 



False Claim 18.5.2. 
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Benchmark 


RISC 


CISC 


CISC/RISC 


E-string search 


150 


120 


0.8 


F-bit test 


120 


180 


1.5 


Ackerman 


150 


300 


2.0 


Rec 2-sort 


2800 


1400 


0.5 


Average 






1.2 



Table 18.1 Sample program lengths for benchmark problems using RISC and 
CISC compilers. 



Here is a counterexample. Define T so that 

Pr[r = l] = ^ and Pr[r = 2] = 

Then 

1 13 
Ex r = 1 . - + 2 • - = - 

and 

1 _ 2 
Ex\f] ~ 3 

and 

11 1 1 _ 3 1 
T'2"^2'2~4^ Ex[l/r]' 

This means that Claim 18.5.1 is also false since we could define S = 1 with prob- 
ability 1. In fact, both Claims 18.5.1 and 18.5.2 are untrue for most all choices of 
S and T. Unfortunately, the fact that they are false does not keep them from being 
widely used in practice! Let's see an example. 

18.5.1 A RISC Paradox 

The data in Table 18.1 is representative of data in a paper by some famous pro- 
fessors. They wanted to show that programs on a RISC processor are generally 
shorter than programs on a CISC processor. For this purpose, they applied a RISC 
compiler and then a CISC compiler to some benchmark source programs and made 
a table of compiled program lengths. 

Each row in Table 18.1 contains the data for one benchmark. The numbers in 
the second and third columns are program lengths for each type of compiler. The 
fourth column contains the ratio of the CISC program length to the RISC program 
length. Averaging this ratio over all benchmarks gives the value 1 .2 in the lower 
right. The conclusion is that CISC programs are 20% longer on average. 
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Benchmark 


RISC 


CISC 


RISC/CISC 


E-string search 


150 


120 


1.25 


F-bit test 


120 


180 


0.67 


Ackerman 


150 


300 


0.5 


Rec 2-sort 


2800 


1400 


2.0 


Average 






1.1 



Table 18.2 The same data as in Table 18.1, but with the opposite ratio in the last 
column. 

However, some critics of their paper took the same data and argued this way: 
redo the final column, taking the other ratio, RISC/CISC instead of CISC/RISC, as 
shown in Table 18.2. 

From Table 18.2, we would conclude that RISC programs are 10% longer than 
CISC programs on average! We are using the same reasoning as in the paper, so 
this conclusion is equally justifiable — ^yet the result is opposite. What is going on? 

A Probabilistic Interpretation 

To resolve these contradictory conclusions, we can model the RISC vs. CISC de- 
bate with the machinery of probability theory. 

Let the sample space be the set of benchmark programs. Let the random variable 
R be the length of the compiled RISC program, and let the random variable C be 
the length of the compiled CISC program. We would like to compare the average 
length Ex[i?] of a RISC program to the average length Ex[C] of a CISC program. 

To compare average program lengths, we must assign a probabiUty to each sam- 
ple point; in effect, this assigns a "weight" to each benchmark. One might Uke 
to weigh benchmarks based on how frequently similar programs arise in practice. 
Lacking such data, however, we will assign all benchmarks equal weight; that is, 
our sample space is uniform. 

In terms of our probability model, the paper computes C / R for each sample 
point, and then averages to obtain YL\[C / R] = 1.2. This much is correct. The 
authors then conclude that CISC programs are 20% longer on average; that is, they 
conclude that Ex [C] = 1.2 Ex [i?]. Therein lies the problem. The authors have 
implicitly used False Claim 18.5.1 to assume that Ex[C/i?] = Ex[C]/Ex[i?]. By 
using the same false logic, the critics can arrive at the opposite conclusion; namely, 
that RISC programs are 10% longer on average. 
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The Proper Quotient 



We can compute Ex[^] and Ex[C] as follows: 



Ex[i?]= i-Vr[R = i] 



ieRange(R) 

150 120 150 



2800 




4 



Ex[C] 



i eRange(C) 

120 180 300 1400 




Now since Ex[i?]/ Ex[C] = 1.61, we conclude that the average RISC program 
is 6 1 % longer than the average CISC program. This is a third answer, completely 
different from the other two! Furthermore, this answer makes RISC look really 
bad in terms of code length. This one is the correct conclusion, under our assump- 
tion that the benchmarks deserve equal weight. Neither of the earlier results were 
correct — ^not surprising since both were based on the same False Claim. 

A Simpler Example 

The source of the problem is clearer in the following, simpler example. Suppose 
the data were as follows. 



Now the data for the processors A and B is exactly symmetric; the two proces- 
sors are equivalent. Yet, from the third column we would conclude that Processor B 
programs are 25% longer on average, and from the fourth column we would con- 
clude that Processor A programs are 25% longer on average. Both conclusions are 
obviously wrong. 

The moral is that one must be very careful in summarizing data, we must not 
take an average of ratios bUndly! 



Benchmark Processor^ Processor B B/A A/B 



Problem 1 2 11/2 2 

Problem 2 1 2 2 1/2 



Average 1 .25 1 .25 
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